1 Introduction

The goal of this file is to arrange gene-centric Caenorhbaditis RNA-seq data originally published as part of the modENCODE project or by the Waterston Lab at the University of Washington (Gerstein et al 2010, Gerstein et al 2014, Boeck et al., 2016, and Warner et al 2019).

1.1 Update Notes

2 Pre-processing Methods Overview

Each genome and gff file were downloaded from WormBase.org version WS290. Reads were aligned to each genome using STAR (v2.7.6a, –alignIntronMax 30000 –alignMatesGapMax 30000) and the species-specific WS290 GTF file for each genome. PCR duplicates were removed using “seldup” (Warner et al., 2019). Read counts were obtained for each gene (CDS region only which is labeled as “CDS” in C. briggsae and C. elegans and as “coding_exon” for C. remanei, C. japonica, and C. brenneri) using featureCounts (from the software package subread-2.0.6) using default settings. Only uniquely mapping reads were counted. Additionally read counts were obtained for the CDS regions and for the full transcripts using the featureCount options -M –fraction so that multimappers were counted by splitting them equally among all of the locations where they aligned. Alignment and counting was performed by LaDeana Hillier (Waterston Lab, UW).

Read data for each species was imported into R and annotated with information from WormBase ParaSite BiomaRT.

Raw reads were quantified as counts per million using the EdgeR package, then filtered to remove transcripts with low counts (less than 1 count-per-million across either 2 or 1 samples). A list of discarded genes and their expression values across life stages was saved. Non-discarded gene values were normalized using the trimmed mean of M-values method (TMM, Robinson and Oshlack ) to permit between-samples comparisons.

For C. briggsae and C. elegans, samples contain significant numbers of biological replicates. Thus these samples were additionally pre-processed for differential gene expression analyses. The mean-variance relationship was modeled using a precision weights approach (Law et al 2014), using the voom function in the limma package. This function produced a variance-stabilized DGEList object that includes precision weights for each gene to control for heteroscedasity, as well as normalized expression values on the log2 scale (log2 counts-per-million).

For C. remanei, C. japonica, and C. brenneri, the lack of consistent biological replicates precludes limma::voom processing and downstream differential expression calculations. To enable app users to visualize gene count data, a DGEList object was manually constructed that contained filtered, normalized log2 CPM values, gene annotations information, and sample information. This file can be loaded into the RNA-seq Browser for plotting.

For C. elegans embryonic timeline data, samples are from four previously published time series. Biological replicates are defined by a previously established inferred timeline that unified samples between these embryo time series (Boeck et al., 2016). Group names represent the average time for biological replicate pairs, which were selected to represent 80 minute developmental intervals. This timing was chosen to match previously published differential expression analyses.

This document saves multiple files that are passed to a Shiny Web App for downstream browsing and on-demand analysis. Note that these files are saved in an Outputs folder; in order to make them accessible to a local version of the Shiny browser they need to be moved to appropriate subfolders within the App folder - the www sub folder (for .csv files) or the Data subfolder (for R objects). Stable copies are already located within those folders and do not need to be replaced unless the pre-processing steps change.

3 Code

Note: Code chunks are collated and echoed at the end of the document in Appendix I.

3.1 Import data into R and generate a Digital Gene Expression List

Generate a digital gene expression list that could be easily shared/loaded for downstream filtering/normalization.

species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
# read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)

# load pre-generated annotation information
load(paste0("./Outputs/",x,"_geneAnnotations"))

# import featureCount output into R ----
if (x == 'elegans' | x == 'briggsae') {
  path <- paste0("./Data/", x, "/featureCount.C_", x,".", targets$Biological_ID, ".CDS.unique_only.ws290.txt")
} else {
  path <- paste0("./Data/", x, "/featureCount.C_", x,".", targets$Biological_ID, ".coding_exon.unique_only.ws290.txt")
}

featureCountData<- rbindlist(lapply(path, fread), idcol="sample") %>%
mutate(sample = targets$sample[sample])
colnames(featureCountData)<-c('sample','geneID', 'geneName', 'stableID', "location", "length", "count")

featureCountData_wider <- featureCountData %>%
  dplyr::select(!c(location, length)) %>%
  pivot_wider(names_from = sample, values_from = count)

counts <- featureCountData_wider %>%
  dplyr::select(-c(stableID, geneName))%>%
  column_to_rownames(var = "geneID")

annotations_sub<-dplyr::select(featureCountData_wider, geneID) %>%
  left_join(annotations, by = "geneID") 

# generate a DGEList
myDGEList <- DGEList(counts,
                     samples = targets$sample,
                     group = targets$group, 
                     genes = annotations_sub)

# Save outputs
# Save myDGEList for downstream filtering
output.name <- paste0(x, '_DGEList')
save(myDGEList,
     file = file.path(output.path,
                      output.name))
}

3.2 Data Filtering and Normalization

The goal of this chunk is to:

  1. Filter and normalize data.
  2. Use ggplot2 to visualize the impact of filtering and normalization on the data.
  3. Save the following files and datasets:
    i) filtered and normalized (but not voom adjusted) log2CPM values (all species).
    ii) a matrix of discarded genes and their raw counts - this data is downloadable from within the Shiny Browser App (all species).
species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
# load pre-generated DGEList information
load(paste0("./Outputs/",x,"_DGEList"))

# load pre-generated annotation information
load(paste0("./Outputs/",x,"_geneAnnotations"))  
  
# read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)
# Generate life stage IDs
ids <- rep(cbind(targets$group), 
           times = nrow(myDGEList$counts)) %>%
  as_factor()

# calculate and plot log2 counts per million ----
# use the 'cpm' function from EdgeR to get log2 counts per million
# then coerce into a tibble
log2.cpm.df.pivot <-cpm(myDGEList, log=TRUE) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample)) %>%
  pivot_longer(cols = -geneID, 
               names_to = "samples", 
               values_to = "expression") %>% 
  add_column(life_stage = ids)

# plot the data
p1 <- ggplot(log2.cpm.df.pivot) +
  aes(x=samples, y=expression, fill=life_stage) +
  geom_violin(trim = FALSE, show.legend = T, alpha= 0.7) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="log2 expression", x = "sample",
       title = paste0("C. ", x, ": Log2 Counts per Million (CPM)"),
       subtitle="unfiltered, non-normalized",
       fill= "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

# Filter the data ----

# filter genes/transcripts with low counts
# how many genes had more than 1 CPM (TRUE) in at least n samples
# Note: The cutoff "n" is adjusted for the number of 
# samples in the smallest group of comparison.
if (x == 'elegans' | x == 'briggsae') {
  keepers <- cpm(myDGEList) %>%
  rowSums(.>1)>=2
}  else {
  keepers <- cpm(myDGEList) %>%
  rowSums(.>1)>=1
}

myDGEList.filtered <- myDGEList[keepers,]

ids.filtered <- rep(cbind(targets$group), 
                    times = nrow(myDGEList.filtered)) %>%
  as_factor()

log2.cpm.filtered.df.pivot <- cpm(myDGEList.filtered, log=TRUE) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample)) %>%
  pivot_longer(cols = -geneID,
               names_to = "samples",
               values_to = "expression") %>%
  add_column(life_stage = ids.filtered) 

p2 <- ggplot(log2.cpm.filtered.df.pivot) +
  aes(x=samples, y=expression, fill=life_stage) +
  geom_violin(trim = FALSE, show.legend = T, alpha= 0.7) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="log2 expression", x = "sample",
       title = paste0("C. ", x, ": Log2 Counts per Million (CPM)"),
       subtitle="filtered, non-normalized",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

# Look at the genes excluded by the filtering step ----
# just to check that there aren't any with 
# high expression that are in few samples
# Discarded genes
myDGEList.discarded <- myDGEList[!keepers,]

ids.discarded <- rep(cbind(targets$group), 
                     times = nrow(myDGEList.discarded)) %>%
  as_factor()

log2.cpm.discarded.df.pivot <- cpm(myDGEList.discarded, log=F) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample)) %>%
  pivot_longer(cols = -geneID,
               names_to = "samples",
               values_to = "expression") %>%
  add_column(life_stage = ids.discarded)

# Genes that are above 1 cpm
log2.cpm.discarded.df.pivot %>%
  dplyr::filter(expression > 1)

# Generate a matrix of discarded genes and their raw counts ----
discarded.gene.df <- log2.cpm.discarded.df.pivot %>%
  pivot_wider(names_from = c(life_stage, samples), 
              names_sep = "-", 
              values_from = expression, 
              id_cols = geneID)%>%
  left_join(annotations, by = "geneID") 


# Plot discarded genes
p.discarded <- ggplot(log2.cpm.discarded.df.pivot) +
  aes(x=samples, y=expression, color=life_stage) +
  geom_jitter(alpha = 0.3, show.legend = T)+
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="expression", x = "sample",
       title = paste0("C. ", x, ": Counts per Million (CPM)"),
       subtitle="genes excluded by low count filtering step, non-normalized",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

  
# Normalize the data using a between samples normalization ----
# Source for TMM sample normalization here:
# https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25
myDGEList.filtered.norm <- calcNormFactors(myDGEList.filtered, method = "TMM")

log2.cpm.filtered.norm <- cpm(myDGEList.filtered.norm, log=TRUE) 

log2.cpm.filtered.norm.df<- cpm(myDGEList.filtered.norm, log=TRUE) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample))

log2.cpm.filtered.norm.df.pivot<-log2.cpm.filtered.norm.df %>%
  pivot_longer(cols = -geneID,
               names_to = "samples",
               values_to = "expression") %>%
  add_column(life_stage = ids.filtered)

p3 <- ggplot(log2.cpm.filtered.norm.df.pivot) +
  aes(x=samples, y=expression, fill=life_stage) +
  geom_violin(trim = FALSE, show.legend = T, alpha = 0.7) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="log2 expression", x = "sample",
       title = paste0("C. ", x, ": Log2 Counts per Million (CPM)"),
       subtitle="filtered, TMM normalized",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

# Save outputs
output.name <- paste0(x, '_FilteringNormalizationGraphs')
save(p1, p2, p3, p.discarded, discarded.gene.df,
     file = file.path(output.path,
                      output.name))

output.name <- paste0(x, '_FilteredNormalizedData')
save(myDGEList.filtered.norm, log2.cpm.filtered.norm,
     file = file.path(output.path,
                      output.name))

# Save a matrix of discarded genes and their raw counts ----
output.name <- paste0(x, '_discardedGenes.csv')
discarded.gene.df %>%    
write.csv(file = file.path(output.path,
                           output.name))

}

3.3 Variance Stabilization and EList Generation

For C. elegans and C. briggsae, the goal of this chunk is to fit data to a linear model for responsively detecting differentially expressed genes (DGEs) using limma::voom. For C. remanei, C. japonica, and C. brenneri, the goal is to generate a DGEList object shaped like the one produced by limma::voom, but that contains filtered and normalized (but not voom adjusted) log2CPM values as well as annotation and study design information.

This chunk saves either the variance-stabilized, filtered and normalized data (C. elegans and C. briggsae) or the filtered and normalized data (C. remanei, C. japonica, and C. brenneri) as an R object with the suffix _DGEList. It also saves a matrix of processed count data as a .csv file for downloading from the App.

species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
# load pre-generated filtered and normalized data
load(paste0("./Outputs/",x,"_FilteredNormalizedData"))

# read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)

if (x == 'elegans' | x == 'briggsae'){
# Compute Variance-Stabilized DGEList Object ----

# Set up the design matrix ----
# no intercept/blocking for matrix, comparisons across group
  group <- factor(targets$group)
  design <- model.matrix(~0 + group) 
  colnames(design) <- levels(group)

# Model mean-variance trend and fit linear model to data ----
# Use VOOM function from Limma package to model the mean-variance relationship
# produces a variance-stabilized DGEList, that include precision 
# weights for each gene to try and control for heteroscedasity.
# transforms count data to log2-counts per million
# Outputs: E = normalized expression values on the log2 scale
  # Note: this object is being called 'DGEList.filtered.norm' instead of 
  # 'v.DGEList.filtered.norm' to make it easier to work with non-variance stabilized data
  # from other species. 
DGEList.filtered.norm <- voom(counts = myDGEList.filtered.norm, 
                                design = design, plot = T)
colnames(DGEList.filtered.norm)<-targets$sample
colnames(DGEList.filtered.norm$E) <- paste(targets$group, 
                                             targets$uniqueness_suffix,sep = '-')
# Save matrix of genes and their filtered, normalized, voom-transformed counts in Shiny app download (www) folder ----
# This is the count data that underlies the differential expression analyses in the Shiny app. 

output.name <- paste0(x, '_log2cpm_filtered_norm_voom.csv')
write.csv(DGEList.filtered.norm$E, 
          file = file.path(www.path,
                           output.name))

# Save variance-stabilized DGEList in Shiny app Data folder----
output.name <- paste0(x, '_DGEList')
save(DGEList.filtered.norm,
     file = file.path(app.path,
                      output.name))
} else{
  DGEList.filtered.norm <- new("EList", list (genes = myDGEList.filtered.norm$genes,  targets = myDGEList.filtered.norm$samples, E = log2.cpm.filtered.norm))
  colnames(DGEList.filtered.norm)<-targets$sample
colnames(DGEList.filtered.norm$E) <- paste(targets$group, 
                                             targets$uniqueness_suffix,sep = '-')
# Save matrix of genes and their filtered, normalized counts in Shiny app download (www) folder----
output.name <- paste0(x, '_log2cpm_filtered_norm.csv')
write.csv(DGEList.filtered.norm$E, 
          file = file.path(www.path,
                           output.name))

# Save DGEList in Shiny app Data folder----
output.name <- paste0(x, '_DGEList')
save(DGEList.filtered.norm,
     file = file.path(app.path,
                      output.name))
}

}

## Multivariate Analysis The goal of this chunk is to implement hierarchical clustering and PCA analyses, plotting a dendrogram of the hierarchical clustering, a PCA plot for the first two principle components, and a PCA small multiple plot. This data is performed on either variance-stabilized, filtered and normalized data (C. elegans, C. briggsae), or filtered and normalized data (C. brenneri, C. remanei, C. japonica).

species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
  # read in DGElist object
  output.name <- paste0(x, '_DGEList')
  load(file = file.path(app.path, output.name))
  
  # read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)
# Perform multivariate analysis on the data
# Identify variables of interest in study design file ----
group <- factor(targets$group)

# Hierarchical clustering ---------------
# Remember: hierarchical clustering can only work on a data matrix, not a data frame

# Calculate distance matrix
# dist calculates distance between rows, so transpose data so that we get distance between samples.
# how similar are samples from each other
distance <- dist(t(DGEList.filtered.norm$E), method = "maximum") #other distance methods are "euclidean", maximum", "manhattan", "canberra", "binary" or "minkowski"

# Calculate clusters to visualize differences. This is the hierarchical clustering.
# The methods here include: single (i.e. "friends-of-friends"), complete (i.e. complete linkage), and average (i.e. UPGMA). Here's a comparison of different types: https://en.wikipedia.org/wiki/UPGMA#Comparison_with_other_linkages
clusters <- hclust(distance, method = "complete") #other agglomeration methods are "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid"
dend <- as.dendrogram(clusters) 

p4<-dend %>% 
  dendextend::set("hang_leaves", c(0.1)) %>% 
  dendextend::set("labels_cex", c(0.5)) %>%
  dendextend::set("branches_lwd", c(0.7)) %>% 
  as.ggdend %>%
  ggplot (offset_labels = -0.5) +
  theme_dendro() +
  ylim(-8, max(get_branches_heights(dend))) +
  labs(title = "Hierarchical Cluster Dendrogram ",
       subtitle = "filtered, TMM normalized",
       y = "Distance",
       x = "Life stage") +
  coord_fixed(1/2) +
  theme(axis.title.x = element_text(color = "black"),
        axis.title.y = element_text(angle = 90),
        axis.text.y = element_text(angle = 0),
        axis.line.y = element_line(color = "black"),
        axis.ticks.y = element_line(color = "black"),
        axis.ticks.length.y = unit(2, "mm"))

# Principal component analysis (PCA) -------------
# this also works on a data matrix, not a data frame
pca.res <- prcomp(t(DGEList.filtered.norm$E), scale.=F, retx=T)
summary(pca.res) # Prints variance summary for all principal components.

#pca.res$rotation #$rotation shows you how much each gene influenced each PC (called 'scores')
#pca.res$x # 'x' shows you how much each sample influenced each PC (called 'loadings')
#note that these have a magnitude and a direction (this is the basis for making a PCA plot)
## This generates a screeplot: a standard way to view eigenvalues for each PCA. Shows the proportion of variance accounted for by each PC. Plotting only the first 10 dimensions.
p5<-fviz_eig(pca.res,
             #barcolor = brewer.pal(8,"Pastel2")[8],
             #barfill = brewer.pal(8,"Pastel2")[8],
             linecolor = "black",
             main = "Scree plot: proportion of variance accounted for by each principal component", ggtheme = theme_bw()) 

pc.var<-pca.res$sdev^2 # sdev^2 captures these eigenvalues from the PCA result
pc.per<-round(pc.var/sum(pc.var)*100, 1) # we can then use these eigenvalues to calculate the percentage variance explained by each PC

# Visualize the PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x[,1:2]) %>% 
  add_column(group = targets$group)
# Plotting PC1 and PC2

p6<-ggplot(pca.res.df) +
  aes(x=PC1, y=PC2,
      label = group,
      fill = group,
      color = group
  ) +
  geom_point(size=4) +
  xlab(paste0("PC1 (",pc.per[1],"%",")")) + 
  ylab(paste0("PC2 (",pc.per[2],"%",")")) +
  labs(title="Principal Components Analysis",
       sub = "Note: analysis is blind to sample identity.",
       fill = "Life stage",
       color = "Life stage") +
  scale_x_continuous(expand = c(.3, .3)) +
  scale_y_continuous(expand = c(.3, .3)) +
  coord_fixed() +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10))

# Create a PCA 'small multiples' chart ----
if (ncol(pca.res$x)>5){
  pca.res.df <- pca.res$x[,1:6]}
else {
  pca.res.df <- pca.res$x[,1:ncol(pca.res$x)]
}
 
pca.res.df <- pca.res.df %>% 
  as_tibble() %>%
  add_column(sample = targets$sample,
             group = targets$group)

pca.pivot <- pivot_longer(pca.res.df, # data frame to be pivoted
                          cols = PC1:PC3, # column names to be stored as a SINGLE variable
                          names_to = "PC", # name of that new variable (column)
                          values_to = "loadings") # name of new variable (column) storing all the values (data)
PC1<-subset(pca.pivot, PC == "PC1")
PC2 <-subset(pca.pivot, PC == "PC2")
PC3 <- subset(pca.pivot, PC == "PC3")
#PC4 <- subset(pca.pivot, PC == "PC4")

# New facet label names for PCs
PC.labs <- c(paste0("PC1 (",pc.per[1],"%",")"),
             paste0("PC2 (",pc.per[2],"%",")"),
             paste0("PC3 (",pc.per[3],"%",")")
)
names(PC.labs) <- c("PC1", "PC2", "PC3")

p7<-ggplot(pca.pivot) +
  aes(x=sample, y=loadings) + # you could iteratively 'paint' different co-variates onto this plot using the 'fill' aes
  geom_bar(stat="identity") +
  facet_wrap(~PC, labeller = labeller(PC = PC.labs)) +
  geom_bar(data = PC1, stat = "identity", aes(fill = group)) +
  geom_bar(data = PC2, stat = "identity", aes(fill = group)) +
  geom_bar(data = PC3, stat = "identity", aes(fill = group)) +
  labs(title="PCA 'small multiples' plot",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  scale_x_discrete(limits = targets$sample, labels = targets$group) +
  theme_bw() +
  theme(text = element_text(size = 5),
        title = element_text(size = 5))+
  coord_flip()


output.name <- paste0(x, '_PCAGraphs')
save(p4, p5, p6, p7,
     file = file.path(output.path,
                      output.name))
}

4 Filtering and Normalization Results

4.1 C. elegans Filtering Results

4.1.1 Unfiltered, non-normalized log2CPM data

4.1.2 Filtered, non-normalized log2CPM data

4.1.3 Filtered, normalized log2CPM data by life stage

4.1.4 Genes discarded by low-copy filtering step

4.1.5 Discarded gene list

4.2 C. briggsae Filtering Results

4.2.1 Unfiltered, non-normalized log2CPM data

4.2.2 Filtered, non-normalized log2CPM data

4.2.3 Filtered, normalized log2CPM data by life stage

4.2.4 Genes discarded by low-copy filtering step

4.2.5 Discarded gene list

4.3 C. brenneri Filtering Results

4.3.1 Unfiltered, non-normalized log2CPM data

4.3.2 Filtered, non-normalized log2CPM data

4.3.3 Filtered, normalized log2CPM data by life stage

4.3.4 Genes discarded by low-copy filtering step

4.3.5 Discarded gene list

4.4 C. remanei Filtering Results

4.4.1 Unfiltered, non-normalized log2CPM data

4.4.2 Filtered, non-normalized log2CPM data

4.4.3 Filtered, normalized log2CPM data by life stage

4.4.4 Genes discarded by low-copy filtering step

4.4.5 Discarded gene list

4.5 C. japonica Filtering Results

4.5.1 Unfiltered, non-normalized log2CPM data

4.5.2 Filtered, non-normalized log2CPM data

4.5.3 Filtered, normalized log2CPM data by life stage

4.5.4 Genes discarded by low-copy filtering step

4.5.5 Discarded gene list

5 Multivariate Analysis Results

5.1 C. elegans PCA Results

5.1.1 Dendrogram to visualize hierachical clustering

Clustering performed on filtered, normalized and variance-stabilized abundance data using the “complete” method.

5.1.2 Screeplot of PCA Eigenvalues

A scree plot is a standard way to view eigenvalues for each PCA. The plot shows the proportion of variance accounted for by each PC.

5.1.3 PCA Plot

Plot of the samples in PCA space. Fill color indicates life stage.

5.1.4 PCA “Small Multiples” Plot

5.2 C. briggsae PCA Results

5.2.1 Dendrogram to visualize hierachical clustering

Clustering performed on filtered, normalized and variance-stabilized abundance data using the “complete” method.

5.2.2 Screeplot of PCA Eigenvalues

A scree plot is a standard way to view eigenvalues for each PCA. The plot shows the proportion of variance accounted for by each PC.

5.2.3 PCA Plot

Plot of the samples in PCA space. Fill color indicates life stage.

5.2.4 PCA “Small Multiples” Plot

5.3 C. brenneri PCA Results

5.3.1 Dendrogram to visualize hierachical clustering

Clustering performed on filtered and normalized abundance data using the “complete” method.

5.3.2 Screeplot of PCA Eigenvalues

A scree plot is a standard way to view eigenvalues for each PCA. The plot shows the proportion of variance accounted for by each PC.

5.3.3 PCA Plot

Plot of the samples in PCA space. Fill color indicates life stage.

5.3.4 PCA “Small Multiples” Plot

5.4 C. remanei PCA Results

5.4.1 Dendrogram to visualize hierachical clustering

Clustering performed on filtered and normalized abundance data using the “complete” method.

5.4.2 Screeplot of PCA Eigenvalues

A scree plot is a standard way to view eigenvalues for each PCA. The plot shows the proportion of variance accounted for by each PC.

5.4.3 PCA Plot

Plot of the samples in PCA space. Fill color indicates life stage.

5.4.4 PCA “Small Multiples” Plot

5.5 C. japonica PCA Results

5.5.1 Dendrogram to visualize hierachical clustering

Clustering performed on filtered and normalized abundance data using the “complete” method.

5.5.2 Screeplot of PCA Eigenvalues

A scree plot is a standard way to view eigenvalues for each PCA. The plot shows the proportion of variance accounted for by each PC.

5.5.3 PCA Plot

Plot of the samples in PCA space. Fill color indicates life stage.

5.5.4 PCA “Small Multiples” Plot

6 Appendix I : All code for this report

knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE,
    collapse = TRUE
)
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if (!require("tximport", quietly = TRUE)) BiocManager::install("tximport", ask = FALSE)
if (!require("pacman", quietly = TRUE)) install.packages("pacman")
pacman::p_load("tidyverse","data.table", "magrittr","edgeR","matrixStats","cowplot","ggthemes","gprofiler2","limma","tximport", "knitr", "ggforce", "ggdendro", "factoextra", "gridExtra", "dendextend")

# Check for presence of output folder, generate if it doesn't exist
output.path <- "./Outputs"
if (!dir.exists(output.path)){
  dir.create(output.path)
}

app.path <-"../Data"
www.path <-"../www"
species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
# read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)

# load pre-generated annotation information
load(paste0("./Outputs/",x,"_geneAnnotations"))

# import featureCount output into R ----
if (x == 'elegans' | x == 'briggsae') {
  path <- paste0("./Data/", x, "/featureCount.C_", x,".", targets$Biological_ID, ".CDS.unique_only.ws290.txt")
} else {
  path <- paste0("./Data/", x, "/featureCount.C_", x,".", targets$Biological_ID, ".coding_exon.unique_only.ws290.txt")
}

featureCountData<- rbindlist(lapply(path, fread), idcol="sample") %>%
mutate(sample = targets$sample[sample])
colnames(featureCountData)<-c('sample','geneID', 'geneName', 'stableID', "location", "length", "count")

featureCountData_wider <- featureCountData %>%
  dplyr::select(!c(location, length)) %>%
  pivot_wider(names_from = sample, values_from = count)

counts <- featureCountData_wider %>%
  dplyr::select(-c(stableID, geneName))%>%
  column_to_rownames(var = "geneID")

annotations_sub<-dplyr::select(featureCountData_wider, geneID) %>%
  left_join(annotations, by = "geneID") 

# generate a DGEList
myDGEList <- DGEList(counts,
                     samples = targets$sample,
                     group = targets$group, 
                     genes = annotations_sub)

# Save outputs
# Save myDGEList for downstream filtering
output.name <- paste0(x, '_DGEList')
save(myDGEList,
     file = file.path(output.path,
                      output.name))
}
species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
# load pre-generated DGEList information
load(paste0("./Outputs/",x,"_DGEList"))

# load pre-generated annotation information
load(paste0("./Outputs/",x,"_geneAnnotations"))  
  
# read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)
# Generate life stage IDs
ids <- rep(cbind(targets$group), 
           times = nrow(myDGEList$counts)) %>%
  as_factor()

# calculate and plot log2 counts per million ----
# use the 'cpm' function from EdgeR to get log2 counts per million
# then coerce into a tibble
log2.cpm.df.pivot <-cpm(myDGEList, log=TRUE) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample)) %>%
  pivot_longer(cols = -geneID, 
               names_to = "samples", 
               values_to = "expression") %>% 
  add_column(life_stage = ids)

# plot the data
p1 <- ggplot(log2.cpm.df.pivot) +
  aes(x=samples, y=expression, fill=life_stage) +
  geom_violin(trim = FALSE, show.legend = T, alpha= 0.7) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="log2 expression", x = "sample",
       title = paste0("C. ", x, ": Log2 Counts per Million (CPM)"),
       subtitle="unfiltered, non-normalized",
       fill= "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

# Filter the data ----

# filter genes/transcripts with low counts
# how many genes had more than 1 CPM (TRUE) in at least n samples
# Note: The cutoff "n" is adjusted for the number of 
# samples in the smallest group of comparison.
if (x == 'elegans' | x == 'briggsae') {
  keepers <- cpm(myDGEList) %>%
  rowSums(.>1)>=2
}  else {
  keepers <- cpm(myDGEList) %>%
  rowSums(.>1)>=1
}

myDGEList.filtered <- myDGEList[keepers,]

ids.filtered <- rep(cbind(targets$group), 
                    times = nrow(myDGEList.filtered)) %>%
  as_factor()

log2.cpm.filtered.df.pivot <- cpm(myDGEList.filtered, log=TRUE) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample)) %>%
  pivot_longer(cols = -geneID,
               names_to = "samples",
               values_to = "expression") %>%
  add_column(life_stage = ids.filtered) 

p2 <- ggplot(log2.cpm.filtered.df.pivot) +
  aes(x=samples, y=expression, fill=life_stage) +
  geom_violin(trim = FALSE, show.legend = T, alpha= 0.7) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="log2 expression", x = "sample",
       title = paste0("C. ", x, ": Log2 Counts per Million (CPM)"),
       subtitle="filtered, non-normalized",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

# Look at the genes excluded by the filtering step ----
# just to check that there aren't any with 
# high expression that are in few samples
# Discarded genes
myDGEList.discarded <- myDGEList[!keepers,]

ids.discarded <- rep(cbind(targets$group), 
                     times = nrow(myDGEList.discarded)) %>%
  as_factor()

log2.cpm.discarded.df.pivot <- cpm(myDGEList.discarded, log=F) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample)) %>%
  pivot_longer(cols = -geneID,
               names_to = "samples",
               values_to = "expression") %>%
  add_column(life_stage = ids.discarded)

# Genes that are above 1 cpm
log2.cpm.discarded.df.pivot %>%
  dplyr::filter(expression > 1)

# Generate a matrix of discarded genes and their raw counts ----
discarded.gene.df <- log2.cpm.discarded.df.pivot %>%
  pivot_wider(names_from = c(life_stage, samples), 
              names_sep = "-", 
              values_from = expression, 
              id_cols = geneID)%>%
  left_join(annotations, by = "geneID") 


# Plot discarded genes
p.discarded <- ggplot(log2.cpm.discarded.df.pivot) +
  aes(x=samples, y=expression, color=life_stage) +
  geom_jitter(alpha = 0.3, show.legend = T)+
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="expression", x = "sample",
       title = paste0("C. ", x, ": Counts per Million (CPM)"),
       subtitle="genes excluded by low count filtering step, non-normalized",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

  
# Normalize the data using a between samples normalization ----
# Source for TMM sample normalization here:
# https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25
myDGEList.filtered.norm <- calcNormFactors(myDGEList.filtered, method = "TMM")

log2.cpm.filtered.norm <- cpm(myDGEList.filtered.norm, log=TRUE) 

log2.cpm.filtered.norm.df<- cpm(myDGEList.filtered.norm, log=TRUE) %>%
  as_tibble(rownames = "geneID") %>%
  setNames(nm = c("geneID", targets$sample))

log2.cpm.filtered.norm.df.pivot<-log2.cpm.filtered.norm.df %>%
  pivot_longer(cols = -geneID,
               names_to = "samples",
               values_to = "expression") %>%
  add_column(life_stage = ids.filtered)

p3 <- ggplot(log2.cpm.filtered.norm.df.pivot) +
  aes(x=samples, y=expression, fill=life_stage) +
  geom_violin(trim = FALSE, show.legend = T, alpha = 0.7) +
  stat_summary(fun = "median", 
               geom = "point", 
               shape = 20, 
               size = 2, 
               color = "black", 
               show.legend = FALSE) +
  labs(y="log2 expression", x = "sample",
       title = paste0("C. ", x, ": Log2 Counts per Million (CPM)"),
       subtitle="filtered, TMM normalized",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10),
        axis.text.y = element_text (size = 5)) +
  coord_flip()

# Save outputs
output.name <- paste0(x, '_FilteringNormalizationGraphs')
save(p1, p2, p3, p.discarded, discarded.gene.df,
     file = file.path(output.path,
                      output.name))

output.name <- paste0(x, '_FilteredNormalizedData')
save(myDGEList.filtered.norm, log2.cpm.filtered.norm,
     file = file.path(output.path,
                      output.name))

# Save a matrix of discarded genes and their raw counts ----
output.name <- paste0(x, '_discardedGenes.csv')
discarded.gene.df %>%    
write.csv(file = file.path(output.path,
                           output.name))

}
species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
# load pre-generated filtered and normalized data
load(paste0("./Outputs/",x,"_FilteredNormalizedData"))

# read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)

if (x == 'elegans' | x == 'briggsae'){
# Compute Variance-Stabilized DGEList Object ----

# Set up the design matrix ----
# no intercept/blocking for matrix, comparisons across group
  group <- factor(targets$group)
  design <- model.matrix(~0 + group) 
  colnames(design) <- levels(group)

# Model mean-variance trend and fit linear model to data ----
# Use VOOM function from Limma package to model the mean-variance relationship
# produces a variance-stabilized DGEList, that include precision 
# weights for each gene to try and control for heteroscedasity.
# transforms count data to log2-counts per million
# Outputs: E = normalized expression values on the log2 scale
  # Note: this object is being called 'DGEList.filtered.norm' instead of 
  # 'v.DGEList.filtered.norm' to make it easier to work with non-variance stabilized data
  # from other species. 
DGEList.filtered.norm <- voom(counts = myDGEList.filtered.norm, 
                                design = design, plot = T)
colnames(DGEList.filtered.norm)<-targets$sample
colnames(DGEList.filtered.norm$E) <- paste(targets$group, 
                                             targets$uniqueness_suffix,sep = '-')
# Save matrix of genes and their filtered, normalized, voom-transformed counts in Shiny app download (www) folder ----
# This is the count data that underlies the differential expression analyses in the Shiny app. 

output.name <- paste0(x, '_log2cpm_filtered_norm_voom.csv')
write.csv(DGEList.filtered.norm$E, 
          file = file.path(www.path,
                           output.name))

# Save variance-stabilized DGEList in Shiny app Data folder----
output.name <- paste0(x, '_DGEList')
save(DGEList.filtered.norm,
     file = file.path(app.path,
                      output.name))
} else{
  DGEList.filtered.norm <- new("EList", list (genes = myDGEList.filtered.norm$genes,  targets = myDGEList.filtered.norm$samples, E = log2.cpm.filtered.norm))
  colnames(DGEList.filtered.norm)<-targets$sample
colnames(DGEList.filtered.norm$E) <- paste(targets$group, 
                                             targets$uniqueness_suffix,sep = '-')
# Save matrix of genes and their filtered, normalized counts in Shiny app download (www) folder----
output.name <- paste0(x, '_log2cpm_filtered_norm.csv')
write.csv(DGEList.filtered.norm$E, 
          file = file.path(www.path,
                           output.name))

# Save DGEList in Shiny app Data folder----
output.name <- paste0(x, '_DGEList')
save(DGEList.filtered.norm,
     file = file.path(app.path,
                      output.name))
}

}
species_list <- tibble(species = c('elegans', 'briggsae', 'brenneri', 'remanei', 'japonica'))
for (x in species_list$species) {
  # read in DGElist object
  output.name <- paste0(x, '_DGEList')
  load(file = file.path(app.path, output.name))
  
  # read in the study design ----
targets <- read_tsv(paste0("./Data/", x, "/", x,"_study_design.txt"),
                    na = c("", "NA", "na"), show_col_types = F)
# Perform multivariate analysis on the data
# Identify variables of interest in study design file ----
group <- factor(targets$group)

# Hierarchical clustering ---------------
# Remember: hierarchical clustering can only work on a data matrix, not a data frame

# Calculate distance matrix
# dist calculates distance between rows, so transpose data so that we get distance between samples.
# how similar are samples from each other
distance <- dist(t(DGEList.filtered.norm$E), method = "maximum") #other distance methods are "euclidean", maximum", "manhattan", "canberra", "binary" or "minkowski"

# Calculate clusters to visualize differences. This is the hierarchical clustering.
# The methods here include: single (i.e. "friends-of-friends"), complete (i.e. complete linkage), and average (i.e. UPGMA). Here's a comparison of different types: https://en.wikipedia.org/wiki/UPGMA#Comparison_with_other_linkages
clusters <- hclust(distance, method = "complete") #other agglomeration methods are "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median", or "centroid"
dend <- as.dendrogram(clusters) 

p4<-dend %>% 
  dendextend::set("hang_leaves", c(0.1)) %>% 
  dendextend::set("labels_cex", c(0.5)) %>%
  dendextend::set("branches_lwd", c(0.7)) %>% 
  as.ggdend %>%
  ggplot (offset_labels = -0.5) +
  theme_dendro() +
  ylim(-8, max(get_branches_heights(dend))) +
  labs(title = "Hierarchical Cluster Dendrogram ",
       subtitle = "filtered, TMM normalized",
       y = "Distance",
       x = "Life stage") +
  coord_fixed(1/2) +
  theme(axis.title.x = element_text(color = "black"),
        axis.title.y = element_text(angle = 90),
        axis.text.y = element_text(angle = 0),
        axis.line.y = element_line(color = "black"),
        axis.ticks.y = element_line(color = "black"),
        axis.ticks.length.y = unit(2, "mm"))

# Principal component analysis (PCA) -------------
# this also works on a data matrix, not a data frame
pca.res <- prcomp(t(DGEList.filtered.norm$E), scale.=F, retx=T)
summary(pca.res) # Prints variance summary for all principal components.

#pca.res$rotation #$rotation shows you how much each gene influenced each PC (called 'scores')
#pca.res$x # 'x' shows you how much each sample influenced each PC (called 'loadings')
#note that these have a magnitude and a direction (this is the basis for making a PCA plot)
## This generates a screeplot: a standard way to view eigenvalues for each PCA. Shows the proportion of variance accounted for by each PC. Plotting only the first 10 dimensions.
p5<-fviz_eig(pca.res,
             #barcolor = brewer.pal(8,"Pastel2")[8],
             #barfill = brewer.pal(8,"Pastel2")[8],
             linecolor = "black",
             main = "Scree plot: proportion of variance accounted for by each principal component", ggtheme = theme_bw()) 

pc.var<-pca.res$sdev^2 # sdev^2 captures these eigenvalues from the PCA result
pc.per<-round(pc.var/sum(pc.var)*100, 1) # we can then use these eigenvalues to calculate the percentage variance explained by each PC

# Visualize the PCA result ------------------
#lets first plot any two PCs against each other
#We know how much each sample contributes to each PC (loadings), so let's plot
pca.res.df <- as_tibble(pca.res$x[,1:2]) %>% 
  add_column(group = targets$group)
# Plotting PC1 and PC2

p6<-ggplot(pca.res.df) +
  aes(x=PC1, y=PC2,
      label = group,
      fill = group,
      color = group
  ) +
  geom_point(size=4) +
  xlab(paste0("PC1 (",pc.per[1],"%",")")) + 
  ylab(paste0("PC2 (",pc.per[2],"%",")")) +
  labs(title="Principal Components Analysis",
       sub = "Note: analysis is blind to sample identity.",
       fill = "Life stage",
       color = "Life stage") +
  scale_x_continuous(expand = c(.3, .3)) +
  scale_y_continuous(expand = c(.3, .3)) +
  coord_fixed() +
  theme_bw() +
  theme(text = element_text(size = 10),
        title = element_text(size = 10))

# Create a PCA 'small multiples' chart ----
if (ncol(pca.res$x)>5){
  pca.res.df <- pca.res$x[,1:6]}
else {
  pca.res.df <- pca.res$x[,1:ncol(pca.res$x)]
}
 
pca.res.df <- pca.res.df %>% 
  as_tibble() %>%
  add_column(sample = targets$sample,
             group = targets$group)

pca.pivot <- pivot_longer(pca.res.df, # data frame to be pivoted
                          cols = PC1:PC3, # column names to be stored as a SINGLE variable
                          names_to = "PC", # name of that new variable (column)
                          values_to = "loadings") # name of new variable (column) storing all the values (data)
PC1<-subset(pca.pivot, PC == "PC1")
PC2 <-subset(pca.pivot, PC == "PC2")
PC3 <- subset(pca.pivot, PC == "PC3")
#PC4 <- subset(pca.pivot, PC == "PC4")

# New facet label names for PCs
PC.labs <- c(paste0("PC1 (",pc.per[1],"%",")"),
             paste0("PC2 (",pc.per[2],"%",")"),
             paste0("PC3 (",pc.per[3],"%",")")
)
names(PC.labs) <- c("PC1", "PC2", "PC3")

p7<-ggplot(pca.pivot) +
  aes(x=sample, y=loadings) + # you could iteratively 'paint' different co-variates onto this plot using the 'fill' aes
  geom_bar(stat="identity") +
  facet_wrap(~PC, labeller = labeller(PC = PC.labs)) +
  geom_bar(data = PC1, stat = "identity", aes(fill = group)) +
  geom_bar(data = PC2, stat = "identity", aes(fill = group)) +
  geom_bar(data = PC3, stat = "identity", aes(fill = group)) +
  labs(title="PCA 'small multiples' plot",
       fill = "Life stage",
       caption=paste0("produced on ", Sys.time())) +
  scale_x_discrete(limits = targets$sample, labels = targets$group) +
  theme_bw() +
  theme(text = element_text(size = 5),
        title = element_text(size = 5))+
  coord_flip()


output.name <- paste0(x, '_PCAGraphs')
save(p4, p5, p6, p7,
     file = file.path(output.path,
                      output.name))
}
load(paste0("./Outputs/elegans_FilteringNormalizationGraphs"))
p1
p2
p3
p.discarded

DT::datatable(discarded.gene.df,
              rownames = FALSE,
                      escape = FALSE,
                      options = list(autoWidth = TRUE,
                                     scrollX = TRUE,
                                     scrollY = '300px',
                                     scrollCollapse = TRUE,
                                     
                                     searchHighlight = TRUE, 
                                     pageLength = 10, 
                                     lengthMenu = c("5",
                                                    "10",
                                                    "25",
                                                    "50",
                                                    "100"),
                                     initComplete = htmlwidgets::JS(
                                         "function(settings, json) {",
                                         paste0("$(this.api().table().container()).css({'font-size': '", "10pt", "'});"),
                                         "}")  
                      )) %>%
  DT::formatRound(columns=c(2:(ncol(discarded.gene.df)-10)), 
                        digits=3)
rm("p1", "p2", "p3", "p.discarded", "discarded.gene.df")
load(paste0("./Outputs/briggsae_FilteringNormalizationGraphs"))
p1
p2
p3
p.discarded
DT::datatable(discarded.gene.df,
              rownames = FALSE,
                      escape = FALSE,
                      options = list(autoWidth = TRUE,
                                     scrollX = TRUE,
                                     scrollY = '300px',
                                     scrollCollapse = TRUE,
                                     
                                     searchHighlight = TRUE, 
                                     pageLength = 10, 
                                     lengthMenu = c("5",
                                                    "10",
                                                    "25",
                                                    "50",
                                                    "100"),
                                     initComplete = htmlwidgets::JS(
                                         "function(settings, json) {",
                                         paste0("$(this.api().table().container()).css({'font-size': '", "10pt", "'});"),
                                         "}")  
                      )) %>%
  DT::formatRound(columns=c(2:(ncol(discarded.gene.df)-10)), 
                        digits=3)
rm("p1", "p2", "p3", "p.discarded", "discarded.gene.df")
load(paste0("./Outputs/brenneri_FilteringNormalizationGraphs"))
p1
p2
p3
p.discarded

DT::datatable(discarded.gene.df,
              rownames = FALSE,
                      escape = FALSE,
                      options = list(autoWidth = TRUE,
                                     scrollX = TRUE,
                                     scrollY = '300px',
                                     scrollCollapse = TRUE,
                                     
                                     searchHighlight = TRUE, 
                                     pageLength = 10, 
                                     lengthMenu = c("5",
                                                    "10",
                                                    "25",
                                                    "50",
                                                    "100"),
                                     initComplete = htmlwidgets::JS(
                                         "function(settings, json) {",
                                         paste0("$(this.api().table().container()).css({'font-size': '", "10pt", "'});"),
                                         "}")  
                      )) %>%
  DT::formatRound(columns=c(2:(ncol(discarded.gene.df)-10)), 
                        digits=3)
rm("p1", "p2", "p3", "p.discarded", "discarded.gene.df")
load(paste0("./Outputs/remanei_FilteringNormalizationGraphs"))
p1
p2
p3
p.discarded

DT::datatable(discarded.gene.df,
              rownames = FALSE,
                      escape = FALSE,
                      options = list(autoWidth = TRUE,
                                     scrollX = TRUE,
                                     scrollY = '300px',
                                     scrollCollapse = TRUE,
                                     
                                     searchHighlight = TRUE, 
                                     pageLength = 10, 
                                     lengthMenu = c("5",
                                                    "10",
                                                    "25",
                                                    "50",
                                                    "100"),
                                     initComplete = htmlwidgets::JS(
                                         "function(settings, json) {",
                                         paste0("$(this.api().table().container()).css({'font-size': '", "10pt", "'});"),
                                         "}")  
                      )) %>%
  DT::formatRound(columns=c(2:(ncol(discarded.gene.df)-10)), 
                        digits=3)
rm("p1", "p2", "p3", "p.discarded", "discarded.gene.df")
load(paste0("./Outputs/japonica_FilteringNormalizationGraphs"))
p1
p2
p3
p.discarded

DT::datatable(discarded.gene.df,
              rownames = FALSE,
                      escape = FALSE,
                      options = list(autoWidth = TRUE,
                                     scrollX = TRUE,
                                     scrollY = '300px',
                                     scrollCollapse = TRUE,
                                     
                                     searchHighlight = TRUE, 
                                     pageLength = 10, 
                                     lengthMenu = c("5",
                                                    "10",
                                                    "25",
                                                    "50",
                                                    "100"),
                                     initComplete = htmlwidgets::JS(
                                         "function(settings, json) {",
                                         paste0("$(this.api().table().container()).css({'font-size': '", "10pt", "'});"),
                                         "}")  
                      )) %>%
  DT::formatRound(columns=c(2:(ncol(discarded.gene.df)-10)), 
                        digits=3)
load(paste0("./Outputs/elegans_PCAGraphs"))
p4
p5
p6
p7
rm("p4", "p5", "p6", "p7")
load(paste0("./Outputs/briggsae_PCAGraphs"))
p4
p5
p6
p7
rm("p4", "p5", "p6", "p7")
load(paste0("./Outputs/brenneri_PCAGraphs"))
p4
p5
p6
p7
rm("p4", "p5", "p6", "p7")
load(paste0("./Outputs/remanei_PCAGraphs"))
p4
p5
p6
p7
rm("p4", "p5", "p6", "p7")
load(paste0("./Outputs/japonica_PCAGraphs"))
p4
p5
p6
p7
sessionInfo()

7 Appendix II: Session Info

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dendextend_1.17.1   gridExtra_2.3       factoextra_1.0.7   
##  [4] ggdendro_0.2.0      ggforce_0.4.2       knitr_1.45         
##  [7] gprofiler2_0.2.3    ggthemes_5.1.0      cowplot_1.1.3      
## [10] matrixStats_1.2.0   edgeR_4.0.16        limma_3.58.1       
## [13] magrittr_2.0.3      data.table_1.15.2   lubridate_1.9.3    
## [16] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
## [19] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1        
## [22] tibble_3.2.1        ggplot2_3.5.0       tidyverse_2.0.0    
## [25] pacman_0.5.1        tximport_1.30.0     BiocManager_1.30.22
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0  viridisLite_0.4.2 farver_2.1.1      viridis_0.6.5    
##  [5] fastmap_1.1.1     lazyeval_0.2.2    tweenr_2.0.3      digest_0.6.34    
##  [9] timechange_0.3.0  lifecycle_1.0.4   ellipsis_0.3.2    statmod_1.5.0    
## [13] compiler_4.3.2    rlang_1.1.3       sass_0.4.8        tools_4.3.2      
## [17] utf8_1.2.4        yaml_2.3.8        ggsignif_0.6.4    labeling_0.4.3   
## [21] htmlwidgets_1.6.4 bit_4.0.5         abind_1.4-7       withr_3.0.0      
## [25] grid_4.3.2        polyclip_1.10-6   fansi_1.0.6       ggpubr_0.6.0     
## [29] colorspace_2.1-1  scales_1.3.0      MASS_7.3-60.0.1   cli_3.6.2        
## [33] rmarkdown_2.26    crayon_1.5.2      generics_0.1.3    rstudioapi_0.15.0
## [37] httr_1.4.7        tzdb_0.4.0        cachem_1.0.8      parallel_4.3.2   
## [41] vctrs_0.6.5       carData_3.0-5     jsonlite_1.8.8    car_3.1-2        
## [45] hms_1.1.3         bit64_4.0.5       ggrepel_0.9.5     rstatix_0.7.2    
## [49] crosstalk_1.2.1   locfit_1.5-9.9    plotly_4.10.4     jquerylib_0.1.4  
## [53] glue_1.7.0        DT_0.32           stringi_1.8.3     gtable_0.3.4     
## [57] munsell_0.5.0     pillar_1.9.0      htmltools_0.5.7   R6_2.5.1         
## [61] vroom_1.6.5       evaluate_0.23     lattice_0.22-5    highr_0.10       
## [65] backports_1.4.1   broom_1.0.5       bslib_0.6.1       Rcpp_1.0.12      
## [69] xfun_0.42         pkgconfig_2.0.3